Business Context. The ability to price an insurance quote properly has a significant impact on insurers' management decisions and financial statements. You are the chief data scientist at a new startup insurance company focusing on providing affordable insurance to millennials. You are tasked to assess the current state of insurance companies to see what factors large insurance providers charge premiums for. Fortunately for you, your company has compiled a dataset by surveying what people currently pay for insurance from large companies. Your findings will be used as the basis of developing your company's millenial car insurance offering.
Business Problem. Your task is to build a minimal model to predict the cost of insurance from the data set using various characteristics of a policyholder.
Analytical Context. The data resides in a CSV file which has been pre-cleaned for you and can directly be read in. Throughout the case, you will be iterating on your initial model many times based on common pitfalls that arise which we discussed in previous cases. You will be using the Python statsmodels package to create and analyze these linear models.
### Load relevant packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf
import os
import string
import re
import sklearn
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
## For a map
import branca
import geopandas
import scipy
from scipy import stats
import folium # package for making maps, please make sure to use a version older than 1.0.0.
from IPython.display import display
from folium.plugins import TimeSliderChoropleth
# This statement allow to display plot without asking to
%matplotlib inline
# always make it pretty
sns.set_style('whitegrid')
plt.style.use('seaborn-whitegrid')
df = pd.read_csv('Allstate-cost-cleaned.csv', index_col=0,
dtype = { # indicate categorical variables
'A': 'category',
'B': 'category',
'C': 'category',
'D': 'category',
'E': 'category',
'F': 'category',
'G': 'category',
'car_value': 'category',
'state': 'category'
}
)
The following are the columns in the dataset:
print(df.shape)
df.head(10)
Write code to visualize the relationship between cost and the following variables. Choose your plots judiciously based on what you know about each variable. Different variable types (categorical vs. numerical) should have different types of plots (e.g. scatter, boxplot, violin plot, etc.) Group your plots together using the plt.subplot() function.
car_ageage_oldestage_youngestduration_previousC_previoushomeownergroup_sizecar_ageAnswer.
fig = plt.figure(figsize=(12,10))
vars_to_look = ['car_age', 'age_oldest','age_youngest', 'duration_previous', 'C_previous', 'homeowner', 'group_size',
'car_value']
for i, var in enumerate(vars_to_look):
plt.subplot(3,3,i+1)
if var in ['homeowner', 'C_previous', 'group_size', 'car_value']:
sns.violinplot(x=df[var], y = df['cost'], palette='viridis',
boxprops={'alpha':0.5}, flierprops={'marker':'o'})
plt.title("Violinplot of " + var)
else:
sns.regplot(x=df[var], y = df['cost'], lowess=True,
line_kws = {'color': "red", 'ls':'--'},
scatter_kws = {"s": 10, 'alpha':0.5, 'color':'blue'} )
plt.title("Scatterplot of " + var)
fig.set_tight_layout(True)
fig = plt.figure(figsize=(14,6))
vars_to_look = [i for i in string.ascii_uppercase[:7]]
for i, var in enumerate(vars_to_look):
plt.subplot(2,4,i+1)
sns.violinplot(x=df[var], y = df['cost'], palette='viridis',
boxprops={'alpha':0.5}, flierprops={'marker':'o'})
plt.title("Violinplot of coverage option: " + var)
fig.set_tight_layout(True)
car_age, age_oldest, age_youngest, duration_previous and the variable cost. With the rest of the variables, no important relationships were observed.
Convert all categorical data to be in the one-hot encoding format.
Answer.
cat_variables = df.select_dtypes(exclude=['number', 'bool']).columns
df_cat = pd.get_dummies(df, columns=cat_variables, drop_first=True)
df_cat
Split your data into training and testing sets (an 80-20 split is a good starting point).
Note: Keep random seed as 1337 in the code cell
Answer.
#
train, test = train_test_split(df_cat, test_size=0.2, random_state=1337)
[print(i,'shape \t->', eval(i).shape) for i in ['train', 'test']];
Fit a multiple linear regression model to the training data regressing cost against all the other variables. Call this model_all. What is the AIC value?
Answer.
ls = list(train.columns) # List all column names
ls.remove('cost') # Remove 'cost' variable
model_all = smf.ols(formula = 'cost ~ ' + ' + '.join(ls),
data = train).fit()
model_all.summary()
print("The AIC for model_all is {:.2f}".format(model_all.aic));
According to model_all, which states are most and least expensive?
Answer.
# Selection all parameters from OLS model
dummy = model_all.params.reset_index(name='Estimate')
dummy = dummy[dummy['index'].map(lambda x: 'state' in x)].sort_values('Estimate', ascending=False) # Select state
# Summarise findings
print('-'*40); print('Most Expensive States'); print('-'*40)
print(dummy.head());
print('-'*40); print('Least Expensive States'); print('-'*40)
print(dummy.tail())
# Remove word state
dummy['abbr'] = dummy['index'].apply(lambda x: re.sub(r'state_','', x))
# Opening downloaded georefence and tagging data for states
US_states = geopandas.read_file("./gz_2010_us_040_00_5m.json", driver="GeoJSON")
state_regions = pd.read_csv("./state_regions.csv")
# Transforming geopandas and merging data
US_states1 = US_states\
.merge(state_regions, how='left', left_on='NAME', right_on='State')\
.merge(dummy.drop(columns='index'), how='left', left_on='State Code', right_on='abbr')
# Fill NaN values - there are some states not contained in the dataset
US_states1['Estimate'] = US_states1['Estimate'].fillna(0)
min_cn, max_cn = US_states1['Estimate'].quantile([0.01,0.99]).apply(round, 2)
colormap = branca.colormap.LinearColormap(colors=['darkred','red','white','blue','darkblue'], vmin=min_cn, vmax=max_cn )
colormap.caption="Regression Estimates of Assurance cost"
USmap = folium.Map(location=[48-10, -102], zoom_start=4.45, tiles="OpenStreetMap")
style_function = lambda x: {
'fillColor': colormap(x['properties']['Estimate']),
'color': 'black',
'weight':1,
'fillOpacity':0.5
}
stategeo = folium.GeoJson(
US_states1.to_json(), name='USmap', style_function=style_function,
tooltip=folium.GeoJsonTooltip(
fields=['State', 'Region', 'Estimate'],
aliases=['State', 'Region', 'Coefficient'],
localize=True)).add_to(USmap)
colormap.add_to(USmap)
USmap
Interpret the coefficients of group_size, homeowner, car_age, risk_factor, age_oldest, age_youngest married_couple , duration_previous. Do the signs and values of these coefficients make sense to you in the context of this business problem?
Answer.
dummy = model_all.params.reset_index(name='Estimate')
vars_to_look = ['group_size', 'homeowner', 'car_age', 'risk_factor', 'age_oldest', 'age_youngest',
'married_couple', 'duration_previous']
# vars_to_look
dummy[dummy['index'].map(lambda x: x in vars_to_look)]
La variable group_size se refiere a
Which variables from model_all are statistically significant? (For categorical variables, consider them to be significant if at least one of their categories are statistically significant). Refit the model using only these variables; call this model_sig. How does this model compare to the previous model?
Answer.
# dummy = model_all.params.reset_index(name='Estimate')
dummy = pd.concat([model_all.params, model_all.bse, model_all.tvalues, model_all.conf_int(), model_all.pvalues], axis=1)
dummy.columns = ['Params', 'SE', 'T', 'IC_LI', 'IC_LS', 'Pval']
dummy = dummy[dummy['Pval'] < 0.05]
dummy = dummy[(np.sign(dummy['IC_LI']) * np.sign(dummy['IC_LS'])) > 0]
# ', '.join(dummy.index)
dummy['IC'] = (dummy['Params'] - dummy['IC_LI']) # Intervals are symetrical
dummy
dummy1 = dummy.sort_values('Params').reset_index()
dummy1 = dummy1[dummy1['index'] != 'Intercept']
# sns.color_palette("rocket", as_cmap=True)
# print(dummy1)
fig, axs = plt.subplots(1, 3, sharex=True, figsize = (13,4))
axs[0].errorbar(dummy1.iloc[:17]['Params'], dummy1.iloc[:17]['index'],
xerr = dummy1.iloc[:17]['IC'], fmt='o')
fig.suptitle('model_all: parameter estimates (except intercept)', fontsize=16)
axs[0].axvline(color='black', ls='--')
#
axs[1].errorbar(dummy1.iloc[17:34]['Params'],
dummy1.iloc[17:34]['index'],
xerr = dummy1.iloc[17:34]['IC'], fmt='o')
axs[1].axvline(color='black', ls='--')
#
axs[2].errorbar(dummy1.iloc[34:]['Params'],
dummy1.iloc[34:]['index'],
xerr = dummy1.iloc[34:]['IC'], fmt='o')
axs[2].axvline(color='black', ls='--')
#
fig.set_tight_layout(True)
model_all) has several variables which are significant, and this are: homeowner, car_age, risk_factor, age_oldest, age_youngest, married_couple, C_previous, duration_previous, state (all dummy variables), car_value, A, B, C, E, F, G. These variables were considered as significant because they have a p-value less than 0.05 and confidence intervals that do not go through zero.
car_age, age_oldest, risk_factor, age_youngest, and duration_previous.
d = model_all.params.reset_index(name='Estimate')
other_vars = (' + '
.join(
d[d['index']
.map(lambda x: re.match(r'^state|^car_value|^[ABCEFG]\_\d', x) != None)]
['index']
))
model_sig = smf.ols(formula = "cost ~ homeowner + car_age + risk_factor + age_oldest"
" + age_youngest + married_couple + C_previous"
" + duration_previous + " + other_vars,
data = train).fit()
model_sig.summary()
# Predictive Performance Metrics
def MAE(prediction,true_values):
return np.mean(np.abs(prediction-true_values))
def RMSE(prediction,true_values):
return np.sqrt(np.mean(np.square(prediction-true_values)))
def MAPE(prediction,true_value):
return np.mean(np.abs((prediction-true_value)/true_value)*100)
df_Metrics = pd.DataFrame({'Model': ['model_all', 'model_sig']})
df_Metrics['df_model'] = list(map(lambda x: eval(x).df_model, df_Metrics['Model']))
df_Metrics['df_resid'] = list(map(lambda x: eval(x).df_resid, df_Metrics['Model']))
df_Metrics['R2'] = list(map(lambda x: eval(x).rsquared, df_Metrics['Model']))
df_Metrics['R2_adj'] = list(map(lambda x: eval(x).rsquared_adj, df_Metrics['Model']))
df_Metrics['F_pval'] = list(map(lambda x: eval(x).f_pvalue, df_Metrics['Model']))
df_Metrics['AIC'] = list(map(lambda x: eval(x).aic, df_Metrics['Model']))
df_Metrics['BIC'] = list(map(lambda x: eval(x).bic, df_Metrics['Model']))
df_Metrics['Cond_Num'] = list(map(lambda x: eval(x).condition_number, df_Metrics['Model']))
df_Metrics['MAE'] = list(map(lambda x: MAE(eval(x).predict(test), test['cost']), df_Metrics['Model']))
df_Metrics['RMSE'] = list(map(lambda x: RMSE(eval(x).predict(test), test['cost']), df_Metrics['Model']))
df_Metrics['MAPE'] = list(map(lambda x: MAPE(eval(x).predict(test), test['cost']), df_Metrics['Model']))
df_Metrics
In addition to the variables in model_sig, add terms for:
age_youngestcar_value and age_youngestand save it to a new model model_sig_plus.
Answer.
model_sig_plus = smf.ols(formula = "cost ~ homeowner + car_age + risk_factor + age_oldest"
" + age_youngest + married_couple + C_previous"
" + I(age_youngest**2) + I(car_age**2) "
" + car_value_b*age_youngest "
" + car_value_c*age_youngest "
" + car_value_d*age_youngest "
" + car_value_e*age_youngest "
" + car_value_f*age_youngest "
" + car_value_g*age_youngest "
" + car_value_h*age_youngest "
" + car_value_i*age_youngest "
" + duration_previous + " + other_vars,
data = train).fit()
model_sig_plus.summary()
df_Metrics = pd.DataFrame({'Model': ['model_all', 'model_sig', 'model_sig_plus']})
df_Metrics['df_model'] = list(map(lambda x: eval(x).df_model, df_Metrics['Model']))
df_Metrics['df_resid'] = list(map(lambda x: eval(x).df_resid, df_Metrics['Model']))
df_Metrics['R2'] = list(map(lambda x: eval(x).rsquared, df_Metrics['Model']))
df_Metrics['R2_adj'] = list(map(lambda x: eval(x).rsquared_adj, df_Metrics['Model']))
df_Metrics['F_pval'] = list(map(lambda x: eval(x).f_pvalue, df_Metrics['Model']))
df_Metrics['AIC'] = list(map(lambda x: eval(x).aic, df_Metrics['Model']))
df_Metrics['BIC'] = list(map(lambda x: eval(x).bic, df_Metrics['Model']))
df_Metrics['Cond_Num'] = list(map(lambda x: eval(x).condition_number, df_Metrics['Model']))
df_Metrics['MAE'] = list(map(lambda x: MAE(eval(x).predict(test), test['cost']), df_Metrics['Model']))
df_Metrics['RMSE'] = list(map(lambda x: RMSE(eval(x).predict(test), test['cost']), df_Metrics['Model']))
df_Metrics['MAPE'] = list(map(lambda x: MAPE(eval(x).predict(test), test['cost']), df_Metrics['Model']))
df_Metrics
model_sig_plus.
To reduce the number of features, it can often be helpful to aggregate the categories; for example, we can create a new variable by assigning each state to a larger region:
state_regions = pd.read_csv('./state_regions.csv')
# should download the above file
state_regions.head()
Create a new column where a state is replaced with the region it is in according to the above table.
Answer.
df_cat2 = df.merge(state_regions, how='left', left_on='state', right_on='State Code')
#
cat_variables = df_cat2.select_dtypes(exclude=['number', 'bool']).columns
cat_variables = [i for i in cat_variables if i not in ['state', 'State', 'State Code']]
df_cat2 = pd.get_dummies(df_cat2, columns=cat_variables, drop_first=True)
df_cat2.columns
Fit the model as in model_sig_plus but this time use region instead of state. Call this model_region.
Answer.
d = pd.Series(df_cat2.columns)[df_cat2.columns.map(lambda x: re.match(r'^car_value|^[ABCEFG]\_\d', x) !=None) ]
d = list(d)
d.extend(['Region_Northeast', 'Region_South', 'Region_West'])
other_vars = ' + '.join(d)
train, test2 = train_test_split(df_cat2, test_size=0.2, random_state=1337)
model_region = smf.ols(formula = "cost ~ homeowner + car_age + risk_factor + age_oldest"
" + age_youngest + married_couple + C_previous"
" + I(age_youngest**2) + I(car_age**2) "
" + car_value_b*age_youngest "
" + car_value_c*age_youngest "
" + car_value_d*age_youngest "
" + car_value_e*age_youngest "
" + car_value_f*age_youngest "
" + car_value_g*age_youngest "
" + car_value_h*age_youngest "
" + car_value_i*age_youngest "
" + duration_previous + " + other_vars,
data = train).fit()
model_region.summary()
What should we do next to minimize features?
Answer.
Using a method of your choice, find the numerical feature(s) in model_region, except for the three we added in Exercise 6, which exhibit multicollinearity. Hint: consider looking at correlations.
Answer.
M_df = pd.DataFrame(data = model_region.model.data.exog,
columns = model_region.model.data.param_names)
M_df.drop(columns = ['I(age_youngest ** 2)', 'I(car_age ** 2)',
'car_value_b:age_youngest', 'car_value_c:age_youngest',
'car_value_d:age_youngest', 'car_value_e:age_youngest',
'car_value_f:age_youngest', 'car_value_g:age_youngest',
'car_value_h:age_youngest', 'car_value_i:age_youngest', 'Intercept'], inplace=True)
plt.figure(figsize=(15,13))
sns.heatmap(M_df.corr(), cmap="RdYlBu",
annot=True, square=True,
vmin=-0.8, vmax=0.8, fmt="+.1f",
annot_kws={'fontsize':'x-small'})
plt.title("model_region: correlations between predictors \n",
fontdict={'fontsize':20, 'verticalalignment':'bottom'});
corr_df = pd.melt(M_df.corr().reset_index(), id_vars='index')\
.rename(columns={'index':'var1', 'variable':'var2'})\
.dropna()\
.query('value != 1')\
.sort_values('value')
corr_df['dat1'] = [M_df.loc[:, [x, y]] for x,y in zip(corr_df['var1'], corr_df['var2'])]
# run bootstrap
def customBoostrap(data, samples, alpha = 0.05):
size = data.shape[0]
B_stats = list()
for i in range(samples):
train = resample(list(range(size)), n_samples = size) # Sample indices
train_data = data.iloc[train, ] # Select sampled data
stat = train_data.corr().iloc[0,1] # Calculate statistic
B_stats.append(stat) # Append to list
upper = np.percentile(B_stats, (1.0 - alpha/2.0) * 100 )
lower = np.percentile(B_stats, (alpha/2.0) * 100)
return (lower, upper)
corr_df['B_CI'] = corr_df['dat1'].map(lambda x : customBoostrap(x, 100))
#
corr_df['LI'] = corr_df['B_CI'].map(lambda x: round(x[0],3))
corr_df['LS'] = corr_df['B_CI'].map(lambda x: round(x[1],3))
corr_df.drop(columns = ['dat1', 'B_CI'], inplace=True)
corr_df.reset_index(drop=True, inplace=True)
# corr_df.to_csv('corr_data.csv')
# corr_df = pd.read_csv('./corr_data.csv', index_col=0)
# Correlation: (752/992) 75.8% of the variable pairs have a significant correlation
corr_df = corr_df[(np.sign(corr_df.LI) * np.sign(corr_df.LS)) > 0]
corr_df
## Summary of variables with highest correlation
print(pd.concat([corr_df.head(10).reset_index(drop=True),
corr_df.tail(10).reset_index(drop=True)], axis=1, ignore_index=False))
Refit model_region after dropping these redundant predictor(s); call this model_region_no_oldest.
Answer.
The variables age_oldest, A_1, A_2, B_1, C_2, C_3, C_4, E_1, F_1, F_2, F_3, G_2, G_3, G_4 show high correlation wit other variables removing them could reduce condition number.
df_cat2 = df.merge(state_regions, how='left', left_on='state', right_on='State Code')
train, test3 = train_test_split(df_cat2, test_size=0.2, random_state=1337)
model_region_no_oldest = smf.ols(formula = "cost ~ homeowner + car_age + risk_factor "
" + married_couple + C_previous"
" + I(age_youngest**2) + I(car_age**2) "
" + car_value * age_youngest "
" + duration_previous ",
data = train).fit()
model_region_no_oldest.summary()
What would you do to diagnose the model_region_no_oldest fit? What does this diagnosis suggest to you? (Hint: try plotting the residuals in various ways.)
Answer.
def diagnosticPlot(model):
res1 = model.resid
sca1 = model.scale
yobs = model.model.data.endog
rangeResid = np.linspace(res1.min(), res1.max(), num=1000)
fig, axs = plt.subplots(2, 2, figsize = (14,8))
axs[0][0].hist(res1, density=True, bins=100, label="residuals")
axs[0][0].plot(rangeResid, scipy.stats.norm.pdf(rangeResid, loc=0.0, scale=np.sqrt(sca1)),
label="normal distribution", ls='--')
#
outliers = np.abs(res1)>4*np.sqrt(sca1)
sns.rugplot(res1[outliers], color="g", label="outliers", ax=axs[0][0])
axs[0][0].legend(loc="upper right")
axs[0][0].set(title='Residual Distribution Plot')
sm.qqplot(res1, line="s", marker='.', alpha=0.5, ax=axs[0][1]);
axs[0][1].set(title='QQ Plot Residuals')
axs[1][0].plot(yobs, res1, ".", alpha=0.2)
axs[1][0].set(ylim = (-300, +300), xlabel="Cost", ylabel = "Residuals", title='Residuals vs DV')
axs[1][0].axhline(0, color='black', ls='--');
sm.graphics.plot_leverage_resid2(model, ax = axs[1][1])
fig.set_tight_layout(True);
diagnosticPlot(model_region_no_oldest)
Find the best Box-Cox transformation of cost used to fit model_region_no_oldest. What value do you get?
Answer.
cost,fitted_lambda = stats.boxcox(train['cost'])
round(fitted_lambda,2)
sns.histplot(cost, kde=True).set(xlabel='Auto insurance cost');
They both look okay, though the square root transformation looks slightly better visually.
Refit model_region_no_oldest, but now with the transformation as suggested by the Box-Cox. Call it model_region_no_oldest_box_cox.
Answer.
model_region_no_oldest_bc = smf.ols(formula = "I(((cost**0.49)-1)/0.49) ~ homeowner + car_age + risk_factor "
" + married_couple + C_previous"
" + I(age_youngest**2) + I(car_age**2) "
" + car_value * age_youngest "
" + duration_previous ",
data = train).fit()
model_region_no_oldest_bc.summary()
diagnosticPlot(model_region_no_oldest_bc)
df_Metrics = pd.DataFrame({'Model': ['model_all', 'model_sig', 'model_sig_plus', 'model_region', 'model_region_no_oldest',
'model_region_no_oldest_bc '],
'Test':['test', 'test', 'test', 'test2', 'test3', 'test3']})
df_Metrics['df_model'] = list(map(lambda x: eval(x).df_model, df_Metrics['Model']))
df_Metrics['df_resid'] = list(map(lambda x: eval(x).df_resid, df_Metrics['Model']))
df_Metrics['R2'] = list(map(lambda x: eval(x).rsquared, df_Metrics['Model']))
df_Metrics['R2_adj'] = list(map(lambda x: eval(x).rsquared_adj, df_Metrics['Model']))
df_Metrics['F_pval'] = list(map(lambda x: eval(x).f_pvalue, df_Metrics['Model']))
df_Metrics['AIC'] = list(map(lambda x: eval(x).aic, df_Metrics['Model']))
df_Metrics['BIC'] = list(map(lambda x: eval(x).bic, df_Metrics['Model']))
df_Metrics['Cond_Num'] = list(map(lambda x: eval(x).condition_number, df_Metrics['Model']))
df_Metrics
In this, you practiced creating linear models using statsmodels and iteratively trimming the input variables to go from including all the variables in the dataset to using only the most relevant variables. You excluded those variables that were statistically insignificant and removed those that had high correlation. Finally, we performed some feature engineering in an attempt to remove some tail behavior that deviates from the normal distribution to better fit our linear model. In the end, we had a very minimal model that contained variables that other insurance companies use to charge premiums that gave us insight on how we can better serve a niche population.